Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan, Zhe; Li, Linjie; Li, Chunyuan; Wang, Lijuan; Liu, Zicheng; Gao, Jianfeng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2210.09263 (cs)

[Submitted on 17 Oct 2022]

Title:Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Authors:Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao

View PDF

Abstract:This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; ($ii$) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and ($iii$) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.

Comments:	A survey paper/book on Vision-Language Pre-training (102 pages)
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2210.09263 [cs.CV]
	(or arXiv:2210.09263v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2210.09263

Submission history

From: Zhe Gan [view email]
[v1] Mon, 17 Oct 2022 17:11:36 UTC (34,554 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators